Interval Comparison #5180

berkaysynnada · 2023-12-07T08:45:38Z

Which issue does this PR close?

Closes #.

Rationale for this change

When intervals are compared, IntervalDayTimeArray and IntervalMonthDayNanoArray exhibit unexpected behavior. This occurs because they are treated as signed integers, but they require special handling since they consist of multiple contiguous time unit blocks.

#[test]
fn test_interval_cmp() {
    let a = IntervalDayTimeArray::from(vec![Some(IntervalDayTimeType::make_value(0, -5))]);
    let b = IntervalDayTimeArray::from(vec![Some(IntervalDayTimeType::make_value(0, 5))]);
    let res = gt(&a, &b).unwrap();
    assert_eq!(&res, &BooleanArray::from(vec![ Some(false)]));
}
...
...
...
assertion `left == right` failed
  left: BooleanArray
[
  true,
]
 right: BooleanArray
[
  false,
]

The first unexpected behavior is observed when the LSD parts of these intervals are negative numbers and the MSD's are 0; under these conditions, intervals with negative values are evaluated as if they are greater than those with positive values.

  #[test]
  fn test_interval_cmp() {
      let a = IntervalMonthDayNanoArray::from(
          vec![Some(IntervalMonthDayNanoType::make_value(0, -10, 0))],
      );
      let b = IntervalMonthDayNanoArray::from(
          vec![Some(IntervalMonthDayNanoType::make_value(0, 20,0))],
      );
      let res = gt(&a, &b).unwrap();
      assert_eq!(&res, &BooleanArray::from(vec![ Some(false)]));
  }

This test also fails in a similar manner.

The second problem arises when comparing relative intervals, such as 1 month and 30 days. In such cases, we can not arrive at an absolute result (e.g. days in a month can change). Therefore, to stay on the safe side, we adopt the following semantics: A comparison returns true if it holds certainly. Otherwise, it returns false. Going back to our example, we return false for both 1 month < 30 days and 1 month > 30 days. It is impossible to impose a total order on intervals, but we can impose a consistent partial order with this logic.

What changes are included in this PR?

IntervalDayTimeArray and IntervalMonthDayNanoArray are separately handled before passing to apply() in compare_op(). They will be represented with milliseconds and nanoseconds, respectively, as the maximum or minimum value that the interval may correspond to.

As we need an exact representation of IntervalMonthDayNano in nanoseconds to prevent overflow cases, a variant for a 128-bit signed integer in DataType is added. Necessary extensions are accordingly implemented.

Are there any user-facing changes?

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

tustvold · 2023-12-07T08:53:08Z

I'm not sure about this, intervals cannot be accurately converted to a number of milliseconds, etc... Whilst the previous ordering might have been surprising it is a total order that matches the equality relation, this PR is neither.

Perhaps we might take a step back as to what you're hoping to achieve with this?

tustvold · 2023-12-07T08:56:21Z

arrow-schema/src/datatype.rs

@@ -50,6 +50,8 @@ pub enum DataType {
    Int32,
    /// A signed 64-bit integer.
    Int64,
+    /// A signed 128-bit integer.
+    Int128,


I don't think we can unilaterally add array types to the arrow specification

ozankabak · 2023-12-07T10:40:33Z

Whilst the previous ordering might have been surprising it is a total order that matches the equality relation, this PR is neither.

Right, I agree that any order should match equality relation and there is a bug here. We will discuss with @berkaysynnada shortly.

Perhaps we might take a step back as to what you're hoping to achieve with this?

Sure, thanks for offering to help. Basically the issue we are facing is that the current total order results in wrong branches being taken during pruning operations in downstream use cases (like Datafusion). We are trying to arrive at a partial order that is compatible with equality, and only returns true when the comparison is definite. Such a consistent partial order can be safely used in branching logic. Do you have any ideas how we can achieve that?

tustvold · 2023-12-07T11:11:42Z

We are trying to arrive at a partial order that is compatible with equality, and only returns true when the comparison is definite
Do you have any ideas how we can achieve that?

I'm not sure such a relation is possible or at least it is not trivially obvious to me. Parquet for example does not produce interval statistics at all (although arrow-rs has historically been writing them, fixed by #5147) - https://github.com/apache/parquet-format/blob/066f9817332da32bdc6dc6dea833b6ee9c269934/LogicalTypes.md?plain=1#L537-L539

A couple of thoughts:

The arrow-ord kernels must be kept consistent with the order used for sorting, which in turn must be a total order, I think this means any partial order would need to be a separate notion
Intervals can only be meaningfully evaluated with respect to a temporal value, it therefore seems surprising to me that we would be pruning on intervals at all, as opposed to using them in the context of range analysis for an arithmetic operation involving them

ozankabak · 2023-12-07T11:18:57Z

Intervals can only be meaningfully evaluated with respect to a temporal type, it therefore seems surprising to me that we would be pruning on intervals at all, as opposed to using them in the context of range analysis for an arithmetic operation involving them

Yes, this was surprising to us too. The issue is there is nothing that bars a user to have a column whose type is an interval and write an expression involving that. In such cases range analysis gets things wrong because of the order issue.

I'm not sure such a relation is possible or at least it is not trivially obvious to me.

Agreed, definitely not trivial. We will try to iterate on this to see if we can arrive at a relation like that -- and if we fail we will learn something and we can come up with a plan B. @berkaysynnada will post an update here today after we think about this a little more.

tustvold · 2023-12-07T11:22:47Z

In such cases range analysis gets things wrong because of the order issue.

Perhaps it could just bail out or something... If it isn't an expected use-case, perhaps we can just fallback to not doing range analysis or something - I have to confess to not have a good grasp of when or how frequently such a situation would arise

berkaysynnada · 2023-12-07T13:31:00Z

I've revised the implementation and now the comparison semantic is as follows: The interval values on both sides of the comparison work as if they reference a common timestamp. If comparing these intervals with respect to this reference gives a definite answer, like 1 month and 1 month + 1 day, that answer is given as the result. However, if there is an indefinite case, like 1 month and 30 days, the result will be false for both greater than and less than comparisons.

This means that comparisons such as "1 month < 1 month + 1 day" are now returning true. The current status of this PR includes:

Proper handling of negative cases.
Correct evaluations of results such as "1 month < 1 month + 1 day".
Inexact results like "1 month + 30 days < 2 months" or "1 month + 30 days > 2 months" consistently return false for all cases, including greater than and less than equal comparisons.
Equality cases remain unaffected.

ozankabak · 2023-12-07T14:00:08Z

Thanks @berkaysynnada. @tustvold, what do you think about this semantics? Seems like it improves the behavior without breaking any use case, but let's try to poke holes and make sure

tustvold · 2023-12-07T14:02:35Z

Whilst I appreciate you working on this, I feel I wasn't clear enough before:

We cannot change the ordering relation of these kernels to a partial order, as it must match that of sorts, which in turn must be a total order
We cannot add new array types that aren't part of the arrow standard

From my perspective this seems like it is trying to frame range analysis as an ordering relation, which seems somewhat contrived? Perhaps this logic could be implemented within the range analysis framework of DataFusion, as opposed to here?

ozankabak · 2023-12-07T14:15:16Z

We cannot change the ordering relation of these kernels to a partial order, as it must match that of sorts, which in turn must be a total order

AFAICT the way the proposed ordering relation works will not break any sorts. Since indeterminate cases return false all ways, it will act like a pseudo total order. This is what I'm trying to nail down by discussing with you -- if it breaks sorts, I agree it is a no-go.

We cannot add new array types that aren't part of the arrow standard
Perhaps this logic could be implemented within the range analysis framework of DataFusion, as opposed to here?

At this point, we are trying to get the code to compile so we can iterate to find the right semantics. The purpose here is not rush a merge or make a de-facto change to arrow spec. The code may very well move to somewhere else if this is not the right place at this time.

My personal take is that returning clearly wrong comparison results for interval datatypes should be fixed at this level. The library claims to support these datatypes after all, performs such comparisons without any error or warning, so returning arbitrary (relative to real-world interval semantics) comparison results silently is simply wrong IMHO.

However, it doesn't really matter at this point -- we can always revisit such decisions over time. Let's try to find a good solution together first, and we can place the code somewhere else to get things moving.

tustvold · 2023-12-07T14:23:51Z

AFAICT the way the proposed ordering relation works will not break any sorts. Since indeterminate cases return false all ways, it will act like a pseudo total order. This is what I'm trying to nail down by discussing with you -- if it breaks sorts, I agree it is a no-go.

Currently the following is always true regardless of input

let sorted = sort(input, None).unwrap();
let len_one = sorted.len() - 1;
let comparison = lt_eq(&sorted.slice(0, len_one), &sorted.slice(1, len_one)).unwrap();
assert_eq!(comparison.true_count(), len_one)

Or to put it differently the arrow library uses a consistent total ordering relation across its kernels, including things like the row format. Whilst I agree the ordering is surprising, and may violate people's expectations, but this is general problem for intervals where even basic arithmetic axioms don't hold, e.g. a + b - a == b.

ozankabak · 2023-12-07T14:41:55Z

Currently the following is always true regardless of input

This will indeed be false with the proposed semantics. I understand that you want to preserve such invariants, even though I'd say it's arguable users would want such an invariant if the underlying datatype doesn't really admit it with its natural ordering semantics. The price to pay (unnatural comparisons) is quite heavy.

There is unfortunately no perfect solution here, the interval data type is too weird 🙁

Whilst I agree the ordering is surprising, and may violate people's expectations, but this is general problem for intervals where even basic arithmetic axioms don't hold, e.g. a + b - a == b.

Such basic arithmetic axioms do not hold for floating point values either but they still have sensible ordering semantics (comparisons coincide with their natural meaning for non-NaN cases, indeterminate comparisons involving NaNs return false all ways). So if you are pointing out yet another weirdness with the interval type, I agree with you -- but these points are not really logically related or support/normalize each other.

Let's fix this in Datafusion in the short term and we can revisit an upstream solution later on when we have more clarity. Maybe you or us will have even better ideas/semantics over time.

tustvold · 2023-12-07T14:43:18Z

comparisons coincide with their natural meaning for non-NaN cases, indeterminate comparisons involving NaNs return false all ways

FWIW we use IEEE total ordering for floats, which don't behave as you describe - https://docs.rs/arrow-ord/latest/arrow_ord/cmp/fn.gt.html

ozankabak · 2023-12-07T14:49:42Z

FWIW we use IEEE total ordering for floats, which don't behave as you describe - https://docs.rs/arrow-ord/latest/arrow_ord/cmp/fn.gt.html

It actually does behave the way I describe. The totalOrder predicate was added to the standard in 2008 and my description is according to pre-2008 behavior. Actually, unless you explicitly use the totalOrder predicate (e.g. simply use something like less-than), it still behaves the way I described (you can check this on your machine too).

Anyway it doesn't matter, we will do the fix downstream. Thanks for the discussions which facilitated an improved semantics for this

tustvold · 2023-12-07T14:54:46Z

Actually, unless you explicitly use the totalOrder predicate (e.g. simply use something like less-than), it still behaves the way I described (you can check this on your machine too).

We do indeed, as documented in the link

alamb · 2023-12-07T15:11:54Z

I know this PR is closed, but I would like to offer two comments:

Since this is the arrow-rs crate, I don't think adding new datatypes like DataType::Int128 is appropriate as they do not appear in the Arrow standard. However, I think the semantics of interval comparison could be implemented without that particular change.
The question of total order / partial order and what invariants are required is still somewhat confusing to me. Can someone give me an example of intervals where this PR's behavior would invalidate the invariant stated by @tustvold ?

let sorted = sort(input, None).unwrap();
let len_one = sorted.len() - 1;
let comparison = lt_eq(&sorted.slice(0, len_one), &sorted.slice(1, len_one)).unwrap();
assert_eq!(comparison.true_count(), len_one)

I would like to (and I will) add a test that documents this assumption / feature in code so that we don't end up in a situation like this again where a proposed changes breaks a non obvious invariant.

tustvold · 2023-12-07T15:14:07Z

I would like to (and I will) add a test that documents this assumption / feature in code so that we don't end up in a situation like this again where a proposed changes breaks a non obvious invariant.

Thank you Andrew. I think the broader invariant is that we consistently order values across kernels, so things like row1 < row2 <=> value1 < value2, or if I sort an array in ascending order i1 < i2 <=> array[i1] <= array[i2]. I agree this isn't documented very clearly beyond a brief mention on https://docs.rs/arrow-array/latest/arrow_array/trait.ArrowNativeTypeOp.html, and should be highlighted more obviously, but I also think this is broadly speaking inline with what people would reasonably expect.

alamb · 2023-12-07T20:54:39Z

Right -- I was looking for a counter example that that would be different with the two different semantic

Maybe something like the following

Sort order on master (even though "logically" one might expect one month - 1 day to be 27, 29, or 31 days, which are all larger than 15 days

IntervalMonthDayNano(1, -1, 0)    # one month - 1 day, 0 ns)
IntervalMonthDayNano(0, 15, 0)    # 0 months, 15 days, 0ns)

But if you changed comparison to the possibly more intuitive ordering:

IntervalMonthDayNano(1, -1, 0)    # one month - 1 day, 0 ns)
IntervalMonthDayNano(0, 15, 0)    # 0 months, 15 days, 0ns)

then the order doesn't match the order of the underlying i128 🤔

alamb · 2023-12-08T15:58:20Z

I filed #5192 with tests and documentation explaining the current behavior and my understanding of the rationale.

alamb · 2023-12-08T18:23:51Z

After further study of this matter, I think it could make sense to add new compairson kernels that were specific to intervals and implemented the partial order interval comparisons as proposed in this PR,

Something like

let arr1 = IntervalMonthDayNanoArray::from(...);
let arr2 = IntervalMonthDayNanoArray::from(...);

// compare arr1 and arr2 using interval specific logic / partial order
let res = interval_partial_lt()

tustvold · 2023-12-08T19:59:39Z

Perhaps I might suggest filing a ticket to discuss what the desired semantics are, linking back to the one in DF, as the first step before jumping into code. In particular I am curious what systems like postgres do.

In general I would like to encourage this pattern of working, as it can help avoid frustration, and ensure all the necessary context is understood before work gets underway. Filing subtle and/or large PRs with no prior discussion is not setting things up for sucess.

alamb · 2023-12-08T20:10:24Z

Perhaps I might suggest filing a ticket to discuss what the desired semantics are

Yes, this is an excellent idea -- I will do so. I agree that for items where the semantics may not be clear, it often helps to start with a ticket rather than a PR

alamb · 2023-12-11T13:56:06Z

I have filed #5200 to discuss how to handle interval comparisons (if at all)

berkaysynnada and others added 8 commits November 28, 2023 14:27

Interval cmp handling

b2aeb39

test added

76fe5ec

Merge branch 'master' into feature/interval-cmp

7855304

clippy fix

b161947

dirty commit to check diff

248942b

Clean code

73433db

Remove fmt diff

6b901fb

Update arrow-ord/src/cmp.rs

48bb07d

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Dec 7, 2023

tustvold reviewed Dec 7, 2023

View reviewed changes

Intervals account for a reference timestamp

9f9382e

berkaysynnada force-pushed the feature/interval-cmp branch from bc4967e to 9f9382e Compare December 7, 2023 13:12

ozankabak closed this Dec 7, 2023

This was referenced Dec 8, 2023

Discussion: handling comparison of intervals apache/datafusion#8468

Open

Clarify interval comparison behavior with documentation and tests #5192

Merged

alamb mentioned this pull request Dec 11, 2023

Comparison of DataType::Interval and Interval**Array #5200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interval Comparison #5180

Interval Comparison #5180

berkaysynnada commented Dec 7, 2023

tustvold commented Dec 7, 2023

tustvold Dec 7, 2023

ozankabak commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 •

edited

Loading

berkaysynnada commented Dec 7, 2023 •

edited by ozankabak

Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 •

edited

Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 •

edited

Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023

ozankabak commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023

alamb commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

alamb commented Dec 7, 2023 •

edited

Loading

alamb commented Dec 8, 2023

alamb commented Dec 8, 2023

tustvold commented Dec 8, 2023

alamb commented Dec 8, 2023

alamb commented Dec 11, 2023

Interval Comparison #5180

Interval Comparison #5180

Conversation

berkaysynnada commented Dec 7, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Dec 7, 2023

tustvold Dec 7, 2023

Choose a reason for hiding this comment

ozankabak commented Dec 7, 2023 • edited Loading

tustvold commented Dec 7, 2023 • edited Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 • edited Loading

berkaysynnada commented Dec 7, 2023 • edited by ozankabak Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 • edited Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023 • edited Loading

ozankabak commented Dec 7, 2023

tustvold commented Dec 7, 2023

ozankabak commented Dec 7, 2023 • edited Loading

tustvold commented Dec 7, 2023

alamb commented Dec 7, 2023 • edited Loading

tustvold commented Dec 7, 2023 • edited Loading

alamb commented Dec 7, 2023 • edited Loading

alamb commented Dec 8, 2023

alamb commented Dec 8, 2023

tustvold commented Dec 8, 2023

alamb commented Dec 8, 2023

alamb commented Dec 11, 2023

ozankabak commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

berkaysynnada commented Dec 7, 2023 •

edited by ozankabak

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

ozankabak commented Dec 7, 2023 •

edited

Loading

alamb commented Dec 7, 2023 •

edited

Loading

tustvold commented Dec 7, 2023 •

edited

Loading

alamb commented Dec 7, 2023 •

edited

Loading